Moving beyond Linearity
We always make linear assumption to assume a model which make our life easizer. However, linear assumption is not always a good approximation, and sometimes even a poor one. That is, we extend the linear model by feature.
For univariate feature (p = 1), we have:
- Polynomial regression
- Step function
- Regression spline
For Multivariate feature (p > 1), we have:
- Local regression
- Generalized additive model
Polynomial Regression
We have where is the error term as Polynomial Regression.
- We can use OLS approach to fit such model.
- Coefficients in polynomial regresson are not interpretable, instead, we will try to interpret the function .
- In this class, we will use for practice and can be choose via cross-validation
- We can also use polynomial regression to do classification, i.e. the logit or the log-odds of logistic regression approach can be write as where .
Step Function
Since the overall structure (or shape) of the polynomial regression is somehow fixed by non-linearity of feature X (i.e. the degree of X), it very bad for extrapolation or we say have larger test error i.e. for data far away from the training data. To solve this problem, we can use Step Function to divide the range of X into several bins.
We can cut up X into many points, let says we wanna cut , then we call the points . Then we can have several dummy variables that sums up to 1 for each observations, denote it as where:
Then we can have the model as where is the error term as Step Function model.
- Note we don’t need in the model when we also have the intercept term .
- Again, we can also fit this model using OLS approach.
- We know can interpret the coefficients, where represents the average change in the respoinse for relative to .
It's seems step function well solve the problem of polynomial regression, but it is based on we have a good choice of cut points , and we may not have a good choice of cut points and then the step function will not show the trend.
Regression Spline
We can use similar idea of step function to solve this question where we use those cut points to make different polynomial in different region. Here we also call those cut point as knots. For knot, we can typically place at certain quantiles of the data or on the range of X with equal space. The more knots willlead to more flexible piecewise polynomial. To have a better performance and interpretation, we better add constraints to polynomials at the knots for:
- Continuity: equal function values
- Smoothness: equal first and second derivatives
- higher order derivatives
If we constrained those piecewise polynomial, then we call them Splines. A degree-d spline contains piecewise degree-d polynomials with continuity in derivatives up to degree (d - 1) at each knot.
We have a basic representation for Regression Spline model: where is the error term, and is the th basis function. We can find that polynomial regressions and step functions are special cases of regression splines.
Some common splines:
Linear spline: piecewise linear function continuous at each knot. Define knots , and we have function with those knots as: where is the error term.
- here
- here where shows the averaged increase of Y associated with one unit of for .
Cubic spline: piecewise cubic function continuous derivatives up to order 2 at each knot. Define knots , and we have function with those knots as: where is the error term.
Natural Spline: a regression spline with additional boundary constraints where function required to be linear at the boundary.
Local Regression
Local regression predicts at a target point using only the nearby training observations.
We design the algorithm as:
- Gather the fraction of the training observations that are closest to .
- Assign a weight function to each point in this neighborhood so that the point furthest from has weight zero, and the closest has the highest weight. All but these nearest neighbors get weight zero.
- Fit a weighted least squares regression of the on the using the aforementioned weights, by finding and that minimize .
- The fitted value at is .
Some notices for local regression algorithm:
- the bandwidth is a tuning parameter that controls the number of neighbors used in the local regression, and we can use cross-validation to choose the best bandwidth.
- the weight of each point in the neighborhood needs to be specified.
- KNN (k-nearest neighbors) is one of the most common local regression approach. It corresponds to a weight function and .
KNN
Algorithm for KNN designed as:
- Find training observations that are closest to test instance .
- Classification output is majority class
We use Voronoi diagram to visualize the behavior in the classification, and the boundary between regions of the feature space assigned to different categories is our decision boundary.
KNN is sensitive to noise and mislabeled data. It is also computationally expensive to compute the distance between each test instance and each training instance. It works not well in higher dimension where points lead to approximately the same distance. That is, neighborhood structure depends on the intrinsic dimensionality of the data, and low intrinsic dimension is the saving grace.
We also have tradeoffs in choosing :
- small leads well capturing fine-grained patterns, more flexible decision boundary but high variance and may sensitive to random idiosyncrasies in the training data.
- large leads to less flexible decision boundary but low variance and may fail to capture important regularities.
- balancing is the optimal choice of depends on number of data points with nice theoretical properties if . And such can be choose from via cross-validation.
Generalized Additive Models (GAMs)
GAMs provide a general framework for extending a standard linear model by allowing non-linear function of each of the variables while maintaininng additivity, where is the error term.
- Each can be linear, polynomials, step function, splines, and local regression.
- Can be used for both regression and classification problems.
- i.e. logistic regression: logit
- Allow non-linear to each so that we can capture non-linear relationship between and .
- Due to additivity, we can interpret the effect of each on by fixing all other .
- But failed to capture interaction between and .